We are constantly consuming multiple genres and subgenres of music each day. Developing an understanding of this music is something widely underappreciated and accomplished, however, the programmers at spotify deem it a necessary feature that was to be integrated into their platform and business model. In doing this, they created an API open for developers to integrate their data into applications and analytics. Other companies like Genius and LastFM have created APIs as well to both bring lyrics and genre tags respectively into the developer’s hands.
In this work is a preliminary investigation of the past decade of music taken from Billboard’s Top 100 songs. Using the Spotify API, these songs, and their related albums have been pulled along with their track features that explain the mood, feeling, or characteristics of tracks. This is a fairly vast dataset of songs and song features that open up a lot of possibility for analysis. Some questions that could be asked are: How does music happiness change across the world? How does happiness relate to other key features in music? Can we form a model that shows this relationship? This and further exporations found at this repo will explore this and more.
#plotting and wrangling
library(tidyverse)
library(highcharter)
library(factoextra)
library(patchwork)
library(sf)
#ml/modeling
library(caret)
#working with time variables
library(lubridate)
#setting the seed for reproducability
set.seed(543)
#data is just imported here. Script for preprocessing and other wrangling and
#imports from the Spotify API located in this folder under
#Spotify_get_script.R
billboard <- read_csv("../data/spotify_full_100_10s.csv")
energy and valence from the last decade?billboard %>%
distinct(id, .keep_all = TRUE) %>%
group_by(album_name) %>%
hchart(type = "scatter", hcaes(energy, valence, group = album_name)) %>%
hc_legend(verticalAlign = "left", layout = "horizontal", x = 30, y = 15, fontSize = "10px") %>%
hc_title(text = "Happiness vs. Energeticness of the Top 50 Albums", align = "center")
#Getting the country codes
countries <- read_csv("../data/country_codes.csv")
#Converting to a shape Object
new_data <- billboard %>%
left_join(y = countries, by = "country") %>%
group_by(album_name, country) %>%
summarise_if(is.numeric, mean) %>%
ungroup() %>%
st_as_sf(coords = c("Latitude", "Longitude"))
#Plotting countries in the world
hcmap(map = "custom/world-robinson-lowres",
data = new_data,
name = "Happiness of Music (Valence)",
value = "valence",
borderWidth = 0,
nullColor = "#d3d3d3") %>%
hc_colorAxis(
stops = color_stops(colors = viridisLite::magma(10, begin = 0.1)),
type = "logarithmic"
) %>%
hc_title(text = "Happiness of the Billboard Top 100 Albums Available Around the World")
From the Spotify Dataset, there are a few parameters that are computed as functions of other parameters found in the data. These parameters are energy, loudness, and acousticness. For example, loudness is used in the calculations of energy. For this reason, these terms will be interaction terms in the regression model.
#selecting all of the numeric attributes for a linear model
model_ready <- billboard %>%
distinct(id, .keep_all = TRUE) %>%
select_if(is.numeric)
# Setting up cross validation
ctrl <- trainControl(method = "cv", number = 10)
# trying to model the relationship between happiness and the other variables
lm_spotify <- train(valence ~ danceability + energy * loudness * acousticness + speechiness +
tempo + instrumentalness + liveness + key, data = model_ready,
method = "lm",
preProc = c("scale", "center"),
trControl = ctrl)
summary(lm_spotify$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51329 -0.12136 -0.01593 0.11517 0.62896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.389355 0.005472 71.160 < 2e-16 ***
## danceability 0.092034 0.006139 14.993 < 2e-16 ***
## energy 0.147593 0.025639 5.757 1.11e-08 ***
## loudness -0.058786 0.038054 -1.545 0.1227
## acousticness 0.102364 0.040714 2.514 0.0121 *
## speechiness 0.023024 0.005889 3.909 9.81e-05 ***
## tempo 0.003168 0.005593 0.566 0.5712
## instrumentalness -0.011792 0.006279 -1.878 0.0606 .
## liveness 0.008281 0.005747 1.441 0.1499
## key 0.004152 0.005507 0.754 0.4510
## `energy:loudness` 0.031487 0.021201 1.485 0.1378
## `energy:acousticness` -0.019365 0.024429 -0.793 0.4281
## `loudness:acousticness` 0.064699 0.055231 1.171 0.2417
## `energy:loudness:acousticness` 0.004288 0.025755 0.167 0.8678
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1845 on 1123 degrees of freedom
## Multiple R-squared: 0.3435, Adjusted R-squared: 0.3359
## F-statistic: 45.19 on 13 and 1123 DF, p-value: < 2.2e-16
Let’s take a look at these plots.
Lets see how the model fits the data
#predicting the values of valence with the models
predictions <- predict(lm_spotify, billboard)
ggplot(billboard, aes(x = valence, y = predictions)) +
geom_point(color= "orange") +
geom_smooth(method = "lm") +
theme_minimal() +
labs(x = "Valence", y = "Predictions", title = "Predictions vs. Observations of Happiness in Music")
## `geom_smooth()` using formula 'y ~ x'